Finger vein recognition based on bilinear fusion of multiscale features

Finger veins are widely used in various fields due to their high safety. Existing finger vein recognition methods have some shortcomings, such as low recognition accuracy and large model size. To address these shortcomings, a multi-scale feature bilinear fusion network (MSFBF-Net) was designed. First, the network model extracts the global features and local detail features of the finger veins and performs linear fusion to obtain second-order features with richer information. Then, the mixed depthwise separable convolution replaces the ordinary convolution, which greatly reduces the computational complexity of the network model. Finally, a multiple attention mechanism (MAM) suitable for finger veins was designed, which can simultaneously extract the channel, spatial, directional, and positional information. The experimental results show that the method is very effective, and the accuracy of the two public finger vein databases is 99.90% and 99.82%, respectively.

The method can be combined with existing databases and recognition algorithms, which guarantees excellent performance of the algorithm, but the recognition time is too long, the recognition time of a single image is 9.2ms 11 . Zhang et al. proposed a joint Bayesian framework based on partial least squares discriminant analysis, which can effectively extract the texture and orientation features of finger veins. Experiments on public databases obtained high recognition accuracy and fast recognition speed. However, when converting the original features into a low-dimensional form, there may be a loss of features 12 . Noh et al. proposed a deep CNN method for finger vein recognition, which fused the texture features and shape features of finger veins, effectively reducing the impact of image noise on the recognition results. This method requires two input images which results in excessive processing time 13 . Hong et al. proposed a novel finger vein recognition network based on VGG16 and achieved excellent results. But the input to the network model needs to use the augmented image, the original unpreprocessed image, and the difference image obtained from the two finger vein images 14 . Yang et al. proposed the FV-GAN network model based on a generative adversarial network, which is the first time to apply a generative adversarial network to the field of finger veins and has strong robustness against outliers and blood vessels ruptures. But requires a large number of training samples 15 . Tang et al. proposed a pre-trained weighted convolutional neural network based on Siamese, which effectively improved the network performance. But the extraction of ROI is not considered 16 . Ren et al. proposed to apply the encryption algorithm to the finger vein recognition task, which greatly improves the security, but due to the need for encryption, it leads to additional time overhead 31 . It can be seen from the above analysis that the existing identification methods have achieved good results in recognition performance, but there are still some shortcomings in recognition time and computational complexity. The method in this paper comprehensively considers the four aspects of recognition accuracy, model parameters, computational complexity, and recognition time, and obtains good performance.
The methods based on deep learning have achieved better results without image preprocessing. However, the quality of the image is still an important factor affecting the performance of the network model. When the image quality is too poor or the database samples are too few, it is difficult for the network to extract the effective features of finger veins, so there is a disadvantage that the recognition accuracy is not high enough. Convolutional neural networks also have shortcomings such as many model parameters, and long training and recognition time. To solve these problems, we propose a multi-scale feature bilinearly fusion network combined with the multiple attention mechanism. Firstly, a multi-scale feature bilinear fusion network is designed. The shallow structure of the network model extracts the detailed features of finger veins, and the deep structure extracts the overall contour of finger veins. Then the shallow features and deep features are bilinear fused to obtain the secondorder features with richer information. Secondly, the lightweight method based on mixed depthwise separable convolution is adopted to reduce the parameters and computational complexity of the network, and speed up network model training and recognition. Finally, to further improve the recognition accuracy, a multiple attention mechanism module is designed and embedded into the network model to simultaneously extract the space, channel, direction, and position information of the finger vein images .

Proposed method
In this section, a novel end-to-end finger vein recognition network framework is proposed to extract finger vein features.
Multi-scale feature bilinear fusion network. Finger veins are composed of vein patterns and have relatively simple features, which results in a certain similarity between different finger veins. Recognition only by extracting vein pattern features will leads to recognition errors. To solve these problems, a multi-scale feature bilinear fusion network was designed. The network structure is shown in Fig. 1.
Take the Conv0 layer as an example, where 3 × 3 represents the size of the convolution kernel, and 64 represents the number of convolution kernels. Firstly, the shallow features of the finger vein image are extracted by two convolution layers. The shallow features include the grain details of the finger vein, and the shallow features are divided into two paths, one of which is reserved for classification. The other way continues to use two convolution layers to extract the deep features including outline of the finger vein from the shallow features. The retained shallow features are dimensionally increased by using a 1 × 1 convolution kernel, and bilinear fusion is performed with deep features to obtain second-order features containing rich information. The formula is as follows: where f 1 (p,i) ∈ R T×M and f 2 (p,i) ∈ R T×M is the deep feature and shallow feature of image i at position p. The deep feature and shallow feature are bilinearly fused to obtain matrix b, which contains the correlation of these two features.
Equation (2) uses sum pooling to aggregate bilinear features on b of all positions in image i, and the feature dimension is M × M. In the convolutional neural network, the fully connected layer is responsible for classifying the final features, while the fully connected layer can only accept one-dimensional feature vectors, so the features need to be flattened to obtain z ∈ R (M×M)×1 .  (4) is to perform L2 normalization on the vector x, where ||x|| 2 represents the norm of the vector x. The signed square root and L2 normalization operations are performed on the obtained one-dimensional features, which can improve the operation speed of the fully connected layer, avoid overfitting, and increase the network performance. Finally, the fully connected layer processes the feature y to complete the classification of finger vein images.
Since the finger vein features are relatively simple, the network model is not very deep, resulting in high efficiency. The multi-scale feature bilinear fusion network is divided into two branches. One branch locates the finger veins and extracts the shallow features (local features) of the finger veins. The other branch continues to process the shallow features and extracts the deep features (vein contours). The two-way features are bilinear fused to obtain second-order features, which contain more useful information. The global and local features of finger vein are fully mined, which makes the prediction and classification of finger vein recognition task more accurate.
Network lightweight. The convolutional neural network has a large number of parameters and computations, which ensures the powerful visual feature expression ability of the network model, but pays the cost of parameter redundancy, which will reduce the training and recognition speed. Because the finger vein recognition algorithm must operate in real-time on hardware with constrained resources, there are somewhat strict limitations on the size of the network model and the number of model parameters. It is required to create a lightweight network model in order to condense the model's size and satisfy the neural network's low-latency operating criteria. We adopted a shallower bilinear fusion network with deep multiscale features, allowing for a smaller number of parameters in the fully connected layer. The lightweight method in this paper is to directly optimize the convolutional layer and replace the conventional convolution with a mixed depthwise separable convolution to reduce the computational complexity of the network model.
The depthwise separable convolution decomposes the convolution process and divides the ordinary convolution into depthwise convolution and pointwise convolution, which greatly reduces the number of parameters and computations. Generally, the depthwise separable convolution uses a 3 × 3 convolution kernel, which is not suitable for finger vein recognition tasks. Due to the particularity of the structure of finger veins, the combination of large receptive field features and small receptive field features can obtain better results. The mixed depthwise separable convolution divides the feature map channels into groups, and each group uses depthwise separable convolutions with different kernel sizes, which can extract receptive field features of different sizes, so it is more suitable for finger vein recognition tasks. Figure 2 shows the convolution process of the mixed depthwise separable convolution. As can be seen from the figure, the mixed depthwise separable convolution receives input tensors, groups the channels of the tensors, and each group uses convolution kernels of different sizes. The features of different receptive fields can be extracted in the same convolution layer, and then different groups of features are combined together. The depthwise separable convolution is divided into two steps, which just separate feature extraction and feature combination. The tensors output in the feature extraction stage will be discharged according to the size of the receptive field, and the output tensors will be mixed in the feature combination stage, so there is no need to worry about the transmission of different receptive fields characteristics. The multi-scale feature bilinear network www.nature.com/scientificreports/ extracts the shallow features and deep features of the finger vein images. Therefore, the lightweight idea is to use a single set of 3 × 3 mixed depth separable convolutions in the shallow network to extract the contour features of the finger veins, and use 3 × 3 and 5 × 5 mixed depthwise separable convolutions in the deep network to extract the detailed features of different receptive fields of finger vein. To further reduce the computational complexity of the network model, the 5 × 5 convolution kernel is replaced with a 3 × 3 (dilation = 2) dilated convolution. The network structure is shown in Table 1.
Compared with ordinary convolution, the mixed depthwise separable convolution has significantly reduced the computational cost and the number of parameters, resulting in a better performance in terms of lightweight. However, depthwise separable convolution also has some shortcomings that cannot be ignored. The depthwise convolution performs independent convolution on each channel of the input tensor, ignoring the spatial location information. Pointwise convolution performs a weighted combination of input tensors in depth, which does not make efficient use of channel information. The combination of them separates the channel and space of the convolution, which affects the transfer of feature information. In response to this problem, a multiple attention mechanism module suitable for finger veins is designed, which will be described in detail in the next section.

Multiple attention mechanism.
For the finger vein recognition task, a network that can extract key features is very important. However, only relying on the basic network to extract finger vein features for prediction and classification, the results often fail to meet their expectations. Firstly, finger vein recognition demands high quality images, while collecting clear and noiseless high-quality finger vein image is always challenging. Secondly, A convolutional neural network will not locate the region of interest of the finger vein map like the human brain but process the whole image, which will extract many useless features. The process of feature classification is a waste of time and will be misled by these useless features, leading to classification errors. To overcome above two issues, The attention mechanism is adopted to save the energy wasted on useless information, and invest more in the key area in order to obtain detailed information in the area. Existing attention mechanisms are roughly divided into spatial attention mechanisms and channel attention mechanisms. If you want to use these two modules at the same time, it is also realized through a series connection, and the information may be lost in the transmission between modules. In view of the characteristics of finger veins and the shortcomings caused by lightweight processing, a multiple receptive field module is designed. The structure is shown in Fig. 3.
The multiple attention mechanism receives an input tensor X ∈ R C×H×W and outputs a tensor Y ∈ R C×H×W with feature enhancement capability. The multiple attention mechanism module is divided into multiple branches for parallel processing. First, the input tensor enters the module, and the first two branches are pooled by (H, 1) and (1, W) through X Avg Pool and Y Avg Pool respectively. Each channel encoding is checked so that features are aggregated along two spatial directions, capturing spatial interactions with location information. The output of channel c with height h and width w can be obtained as follows:  www.nature.com/scientificreports/ After helping the network model to find the region of interest of the finger vein, the two-way features are cascaded, and then the 1 × 1 size convolution is used to reduce the dimension of the features, which reduces the computational complexity. The formula is as follows: where F 1 represents a 1 × 1 size convolution operation, and δ represents non-linear activation function. Finally, f is divided into two independent tensors along the spatial dimension, and the sigmoid function is used to limit the features to 0-1 after using 1 × 1 convolution to increase the dimension. The final result is used as a weight parameter to enhance important features and weaken unimportant feature information: The first two branches complete the horizontal and vertical directions of channel attention and then introduce the spatial attention branch. After the input feature map is pooled, the channel dimension becomes 1, and the spatial information is aggregated to obtain average pooling feature F avg ∈ R 1×H×W and max pooling features F max ∈ R 1×H×W respectively. The two feature maps are then concatenated together, and a conventional convolution is used to generate an attention image with spatial information. The calculation method is as follows: where σ represents the sigmoid function and F 7 represents a 7 × 7 size convolution operation. The sigmoid function can assign corresponding weights according to the different degrees of information contribution at different positions. Equation (11) multiplies and sums multiple input features and the weights obtained by the corresponding attention mechanism, and then calculates the average value to obtain the final feature Sigmoid Sigmoid www.nature.com/scientificreports/ The multiple attention mechanism embeds the position information into the channel attention and combines with the spatial attention mechanism, which not only extracts the channel domain and spatial domain information, but also captures the directional and positional information of the finger veins, and performs attention operations in large areas. Improved network performance at a lower computational cost. The multiple attention mechanism assigns weights to different information on the finger vein image according to the degree of contribution, which is very suitable for finger vein recognition tasks and makes up for the deficiencies caused by depthwise separable convolution. The combination of a multi-scale feature bilinear fusion network with mixed depthwise separable convolution and multiple attention mechanism is the final model proposed for the finger vein recognition task in this paper. The network structure is shown in Fig. 4.

Experiments and analysis
Finger vein database. To evaluate the performance of the proposed network model from multiple aspects, we conduct experiments on two public finger vein databases, SDUMLA 17 and FV-USM 18 .
SDUMLA. SDUMLA considers the influence of translation, rotation, and other factors during acquisition, so the image quality is slightly worse. A total of 106 volunteers participated in the collection work. The index finger, middle finger, and ring finger of both hands were collected. Each finger captured six images. Finally, 636 finger vein categories and 3816 finger vein images were obtained, and the image size was 320 × 240.
FV-USM. The quality of vein image collection in FV-USM is high. 123 volunteers participated in the collection work. Six images of their left and right index fingers and middle fingers were collected respectively. The images were collected in two times, and six finger vein images were captured in each collection. Finally, 492 finger vein categories and 5904 finger vein images are obtained, and the image size is 640 × 480.
The public finger vein database takes into account differences in age, gender, and region of volunteers when collecting, and also considers the influence of non-ideal images such as angle, occlusion, and dark light. In order to facilitate the analysis and comparison of the experimental results, the database image size is adjusted to 60 × 180. The pictures in the public database are randomly divided according to the ratio of 1:5, one is the test database, and the five are the training database, to ensure that they are independent of each other. Comparison with classical network. As can be seen from the data from Table 2, compared with the classical network model, the multi-scale feature bilinear fusion network performs very well. This paper proposes MSFBF-Net, and then obtains two variant networks based on this network model, which are the lightweight MSFBF-Net(Lightweight MSFBF-Net) and the lightweight MSFBF-Net combined with multiple attention mechanism(Lightweight MSFBF-Net + MAM). The recognition accuracy of MSFBF-Net on two public databases is as high as 99.59% and 99.53%. Single image recognition time is only about 3 ms. Both model parameters and computational complexity have certain advantages. After a comprehensive comparison, it is concluded that the multi-scale feature bilinear fusion network is the most suitable for the finger vein recognition task.

Evaluation indicators.
Ablation experiment. In order to verify the correctness of ideas and methods and the effectiveness of each step, it is necessary to conduct ablation experiments. The algorithm is evaluated from the following four aspects: recognition accuracy, model parameters, computational complexity, and single image recognition time. Next, FV-USM data are used for analysis, and the experimental results are shown in Table 3. Although the multi-scale feature bilinear fusion network model has obtained a good recognition accuracy, the parameters and computational complexity are very large. After the network model is lightweight, a lot of computational complexity is reduced, and the recognition accuracy decreased from 99.59 to 99.29%, which is consistent with our expectation. The decrease in recognition accuracy is due to the use of mixed depthwise separable convolutions in the network. The depthwise separable convolution destroys the correlation between channels and spaces, resulting in a decrease in recognition accuracy. Finally, the multiple attention mechanism is embedded into the lightweight network model. It can be seen that the accuracy rate has increased from 99.29% to 99.90%, and the amount of computation has hardly increased, still 0.043GFlops. It is very convenient for the model to be deployed on hardware devices later. Comparing the lightweight model with the novel lightweight network model, we find that the method in this paper maintains high recognition accuracy, low computational complexity and less recognition time, while the amount of model parameters is acceptable. This is because the model parameters are composed of convolution kernel parameters and weight parameters. There are few convolution layers in our network model, so it is not obvious to reduce the parameters of the lightweight model. But most of the computation is in convolutional layers, which reduces Flops drastically from 0.21 to 0.043 GFlops. The recognition time of a single image of the final model is only 4 ms. This is consistent with the theoretically expected results, demonstrating the effectiveness of the algorithm at each step.
Compared with the traditional algorithm. Table 4 shows the experimental results on FV-USM. It can be seen that the recognition accuracy of the lightweight multi-scale feature bilinear fusion network combined with multiple attention mechanism is better than that of the excellent traditional finger vein recognition algorithm. In addition, when the network model is used for finger vein recognition, the finger vein database image is directly used without preprocessing, which greatly reduces the time cost.  network combined with the multiple attention mechanism and the existing convolutional neural network are tested on two public databases, and the experimental results are shown in the Table 5. Among the excellent finger vein recognition algorithms in recent years, Ren et al. 's method has the best comprehensive performance, but our method has better performance in recognition accuracy and recognition time. Figure 5 shows the recognition curves of our method and some other network models on two public databases. It can be seen that our method has advantages in recognition accuracy. Figure 6 is the training loss curve and valid loss curve obtained by using the lightweight MSFBF-Net architecture with attention mechanism proposed in this paper to experiment with FV-USM and SDUMLA. It can be seen from the figure that the model gradually converges after 100 epochs. Figure 7 shows some poor-quality images with obvious noise and unclear vein lines in public databases. Other methods of identification always identify errors. The lightweight multi-scale feature bilinear fusion network combined with multiple attention mechanism is designed for the characteristics of finger veins, which can extract the multi-scale features with rich information, and simultaneously extract the channel, spatial, directional, and positional information to assist recognition.

Conclusion
Aiming at the problems of large models, many parameters, and long recognition time in existing finger vein recognition methods, a lightweight multi-scale feature bilinear fusion network combined with multiple attention mechanism was proposed. Firstly, the multi-scale feature bilinear fusion network extracts the global features and  www.nature.com/scientificreports/ local detail features of the finger vein image. After bilinear pooling, a second-order feature containing richer information is obtained. Then, the mixed depthwise separable convolution is used to lightweight the network, which greatly reduces the computational complexity of the network. Finally, a multiple attention mechanism is designed to make up for the shortcomings of the mixed depthwise separable convolution and strengthen the network model's ability to extract subtle features. The proposed method is tested on two public databases, and the recognition accuracy is up to 99.90% and 99.82%, which is 7.62% higher than other methods. After the lightweight processing of the network, the computational complexity is only one-fifth of the original, and the recognition time of a single image is only 4 ms. Comprehensive comparison, the effect of the method in this paper is significantly improved, which is very suitable for finger vein recognition tasks.

Data availability
The data that support the findings of this study are available from SDUMLA (https:// time. sdu. edu. cn/ kycg/ gksjk. htm) and FV-USM (http:// drfen di. com/ fv_ usm_ datab ase/) but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the corresponding author upon reasonable request and with permission of official.  www.nature.com/scientificreports/